NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Toxicity detection for free

Hu, Zhanhao; Piet, Julien; Zhao, Geng; Jiao, Jiantao; Wagner, David (December 2024, 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024))

Current LLMs are generally aligned to follow safety requirements and tend to refuse toxic prompts. However, LLMs can fail to refuse toxic prompts or be overcautious and refuse benign examples. In addition, state-of-the-art toxicity detectors have low TPRs at low FPR, incurring high costs in real-world applications where toxic examples are rare. In this paper, we introduce Moderation Using LLM Introspection (MULI), which detects toxic prompts using the information extracted directly from LLMs themselves. We found we can distinguish between benign and toxic prompts from the distribution of the first response token’s logits. Using this idea, we build a robust detector of toxic prompts using a sparse logistic regression model on the first response token logits. Our scheme outperforms SOTA detectors under multiple metrics.
more » « less
Full Text Available
Toxicity Detection for Free

Hu, Zhanhao; Piet, Julien; Zhao, Geng; Jiao, Jiantao; Wagner, David (December 2024, Advances in Neural Information Processing Systems)

Full Text Available
Online Learning in Stackelberg Games with an Omniscient Follower

Zhao, Geng; Zhu, Banghua; Jiao, Jiantao; Jordan, Michael (August 2023, Proceedings of Machine Learning Research)

We study the problem of online learning in a two-player decentralized cooperative Stackelberg game. In each round, the leader first takes an action, followed by the follower who takes their action after observing the leader’s move. The goal of the leader is to learn to minimize the cumulative regret based on the history of interactions. Differing from the traditional formulation of repeated Stackelberg games, we assume the follower is omniscient, with full knowledge of the true reward, and that they always best-respond to the leader’s actions. We analyze the sample complexity of regret minimization in this repeated Stackelberg game. We show that depending on the reward structure, the existence of the omniscient follower may change the sample complexity drastically, from constant to exponential, even for linear cooperative Stackelberg games.
more » « less
Full Text Available
Evolution of microstructure and residual stress for a lead-frame Cu-2.13Fe-0.026 P (wt%) alloy

https://doi.org/10.1016/j.jallcom.2023.171383

Cao, Taifeng; Wang, Shaohua; Zhao, Geng; Wu, Xinlong; Liaw, Peter K.; Qiao, Junwei (November 2023, Journal of Alloys and Compounds)

Full Text Available

Search for: All records